Assembly Language
©
Copyright Brian Brown, 1988-2000. All rights reserved.
| Notes | Home Page |
This module is the individual work of Brian Brown. It may not be copied or used in any form without his permission.
OBJECTIVE
The study of advanced micro-processor architectures will
aid the student in their understanding of complex systems and enable effecient
software production.
INTRODUCTION 32bit micros (68020/30/40, iAPX286/386/486)
The common
characteristics of 32bit micro-processors are,
Instruction pre-fetching is a technique which fills the processors internal instruction queue whilst it is busy decoding/executing the current instruction. The idea is to transfer the program instructions from system memory into high speed processor storage, so the processor can run as fast as possible without wait-states.
The trend in modern processors is to separate the decode/execution logic from the bus interface unit (which controls access to the system busses). This allows the BIU to fetch the next instruction whilst the DEU is handling the current instruction. An instruction queue is used to link the two seperated units together. The BIU trys to keep the instruction queue fully loaded, whilst the DEU trys to get its next instruction from the instruction queue. Immediate type of instructions will be executed faster with this approach. Other memory access instructions (like direct addressing) will require the DEU to ask the BIU to perform the operand fetch/write on its behalf.
CPU Performance can be increased by
In general, options 1) and 3) have limits, which are quickly reached. The easy option is to increase the clock speed, thus having smaller cycle times. However, this requires faster RAM for the processor, which is costly. Wait states are used to interface DRAM to processors where the DRAM cannot respond at the speed of the processor. The CPU will hold the address and control lines for extra clock cycles to give the DRAM sufficent time to respond.
To build a system with zero wait state memory requires
CPU Pipelining is dividing the DEU logic into several stages. Using these seperate stages allows the processor to decode/execute several instructions at the same time. Immediate operands will be found in the previous DEU, thus this instruction can be executed without any reference to external memory or the internal instruction queue. In fact, several register type instructions could all be executed by different DEU at the same time, resulting in certain instructions having an execution time of zero clocks.
Cache memory is high speed (on or off-chip) which interfaces between the CPU and system DRAM. The CPU can access the instructions/data stored inside cache at a faster rate than those stored in system memory. A cache controller is used to pre-fetch instructions in order to keep the cache relatively full. This allows the processor to run at full speed (zero-wait state).
Most cache systems will run at about 98% effeciency, which means that 98% of all CPU requests for instructions/data are found in cache memory. The size of the cache is often the determining factor in this. The cache cannot be made to large, else it can overload the system bus, cost too much, and might never be full.
A CACHE HIT means the processor has found the required data/instruction inside the cache. A CACHE MISS is the opposite.
The cache works by using an Address Tag Comparator. This is a hardware counter used by the cache which is initially set to the same address as that output by the processor. On the 1st memory read by the CPU, the ATC loads itself with the same address. Whilst the processor is decoding/executing that instruction, the cache hardware controller accesses subsequent memory locations, gradually filling up the cache table. When the CPU issues the next address, the ATC checks to see if it is within the range of the cache table. If not, the CPU is connected dircetly to RAM, then the ATC updates itself and clears the cache table entries.
Performance is increased using cache memory, but factors affecting its performance are,
Programs with jump statements impede cache use by continually flushing the cache entries. Some processors support both instruction and data caches. This overcomes the cache becoming flushed by a memory write. When using a data cache, you need to implement some method of handling writes to locations which also appear in cache. The choices are
The write through method writes both to the data cache entry and system memory simultaneously. This prevents cache and system data getting out of step. The flush on delay method results in faster performance, as data written to cache might need to be used in the immediate future (loop count variables etc). It however, requires more complex hardware to implement.
The use of cache memory is employed using system DRAM which supports a pulsed CAS access scheme.
Nibble mode access is a way of accessing four sequential bits in a 4x1bit array. On the first access, the RAS and CAS lines are taken low, and the data is R/W in the cycle time (say 80ns). If RAS is held low, and CAS pulsed, the next three locations can be accessed without going through the required setup times. This means the next three locations can be accessed at a quicker rate (say 35ns). The cache controller provides the necessary timing signals for accessing the system DRAM using pulsed CAS nibble mode.
Memory Interleaving
Is a memory access technique which divides the
system memory into a series of equal sized banks. These banks are expressed in
terms of an interleave factor (eg, 2x, 4x etc).
Data is read 32 bits at a time (No need for A0 and A1). Upon a read, the first data is read using one wait state, whilst the second is read with zero wait state using external address pipelining by the interleave controller. The above diagram shows a two-times interleave.
The advantages of interleaved memory (for a system utilizing single wait state memory, ie, 1 wait state),
In essence, its a bit like pulsed CAS memory, in that the first access runs at normal speed, but subsequent accesses in other banks are accessed at a faster rate. The requirements for memory interleaving are,
In systems using DMA, it can have the advantage of multiple bank access, in that the CPU can be accessing one bank whilst the DMA device another.
Used a latched address system, the address bus can be changed once it has been latched/captured by the system memory. The next subsequent address can thus be set-up whilst the memory is still using the first.
Memory interleaving is directly supported by the iAPX386 processor. It performs address pipelining, which is placing the address of the next bus cycle before the current bus cycle is finished. This can happen in latched systems as there is at least a couple of clock cycles from the time the address is latched till the time that data becomes valid. It is during this period that the processor can be instructed to present the address of the next bus cycle on its address pins. It does this when the input pin NA is asserted by the interleave controller.
BANKED MEMORY
The early eight bit computer systems with their
limited address range (64k) quickly ran out of memory. Programmers required
greater amounts of RAM, but the processor could not access this. Several
techniques evolved which increased the physical RAM in the computer system. Some
of these were
Banked memory overcame the problem by assigning multiple banks of memory (all the same size) at a single address space. These banks could be switched in and out of the main address space of the processor by writing to a port or memory address. A bank could be switched in, data copied to it, then switched out again.
Typical circuitry was,
Only one ouput is active low for a secified d0-d2 input combination.
In the IBM-PC/XT, this arrangement is called EMS (Expanded Memory System), and uses a bank size of 64k managed as a series of pages (from 2k in size upwards). A special software driver (EM.SYS) is used to manipulate the memory banks, which are normally mapped in the region A0000 to F0000. This is because the iAPX86 has a limited 1mb of addressable memory. By placing commonly used utilities in EMS, this frees up more of the 640k system RAM for DOS. Note that software packages such as LOTUS support EMS memory.
EXTENDED MEMORY (AT/386)
Extended memory is placed above the 1mb
boundary of the iAPX86 processor. AT's and 386's can address more memory
(AT=16mb, 386=4gb). This memory is not accessable to DOS (which is limited to
640k), but can be configured for use as a RAMDRIVE (using VDISK.SYS or
RAMDRIVE.SYS).
This memory is accessable only in PROTECTED MODE, but note that some software contains routines written in protected mode to enlarge the DOS workspace (a good example being AUTOCAD).
DYNAMIC BUS SIZING
The 32bit processors can automatically
reconfigure themselves to suit the data bus size of different peripheral devices
on a cycle by cycle basis. This is called Dynamic Bus Sizing. This means that to
access 8bit memory for a 32bit register load, the processor will automatically
run four consecutive bus cycles to obtain the 32bits of required data. Input
pins to the processor can be used to change the size of the data transfer on a
cycle by cycle basis.
APPROACHES TO PERFORMING I/O
I/O MAPPING (Intel series)
Rather than cluttering up the main
address space, Intel processors provide a seperate address space for handling
peripheral devices. A special control line IO/M selects the main address map or
the port map.
The signal line is used to generate appropriate address decoding for the chip select signal.
Note that only the low 16 bits of the address bus is used to select one of 65536 ports.
Some advantages are,
Some disadvantages are,
The 386 processor provides an I/O bit privledge per task. The privledge map can protect sensitive peripherals from specific tasks.
MEMORY MAPPING (Motorola)
The peripheral devices are located within
the main address space of the processor.
Some advantages are
Some disadvantages are
IO CHANNEL COPROCESSOR (IBM)
To allow concurrent operation of the
CPU and I/O devices requires the use of a special I/O processor. The main CPU
instructs the I/O processor to perform the required data transfer. When the
transfer is completed, the I/O processor informs the main processor of the
status of the operation.
This method frees the main processor to perform other tasks whilst I/O is being done (tasks requesting I/O are blocked by the OS and thus not scheduled for processor time).
Typical features of an I/O channel processor system are
There are two main types of IO channels
Both channels support a number of devices on a bus called a sub-channel.
The selector channel operates in burst mode only. It handles a single sub-channel at a time, and has very high transfer rates. Typically, it controls high speed disk units.
The multiplexor channel handles more than one sub-channel at a time by interleaving requests. It operates in byte and word mode, but does support burst at a much lower rate than a selector channel. Typically, it handles devices like printers and character terminals.
Channel Operation
The processor initiates an I/O transfer by
setting up a special IOC program in main memory. It then issues a STARTIO
instruction, which identifies the channel and sub-channel.
The channel then accesses and runs the channel program (the address of which is in location 72). When finished, the channel updates the IO flag in the processors status register to signal command completion. The processor then checks the channel status register for results.
Each channel gets informed of
The channel is a sophisticated DMA controller.
Processor Architectures
Accumulator
The processor consists of one or more accumulators
which are used for data storage. The majority of instructions deal with
transferring data memory and the accumulators. Data is manipulated once in the
accumulators. An example of this architecture is the MC6802.
Instructions are executed according the fetch, decode, execute cycle. Instructions are generally single byte opcodes with 0, 1 or 2 operands.
General Register
The processor consists of a large bank of
registers, which can be used in most of the available addressing modes (as index
or data registers). All registers are the same size, and the majority of
instructions deal with the manipulation of registers.
The instruction size is expanded, with opcodes 1, 2 or 3 bytes long. This reflects the necessity of encoding register and effective address fields into the opcode of the instruction.
The MC68000 family is an example of this architecture, using 3bit fileds for encoding the source and destination register fields.
Reduced Instruction Set Computers
Early computer systems were
implemented using an ALU and control system. The control system was comprised of
discrete logic devices. The interconnections between these logic devices proved
to be very unreliable, difficult to design and debug.
In the early 1950's, a british pioneer named Maurice Wilkes came up with a system which implemented the control system via a block of fixed memory. Each column in the memory represented a control line, and the 0's and 1's in each row would set a specific state in the ALU. This simplified the design of control systems for processors, and became known as micro-programming.
Several trends started along the development of the CISC (complex instruction set computer). These were
Compiler writers wanted complex instructions so that the task of translation was easier and more effecient.
However, in the mid 1970's several researchers began to have doubts about CISC architectures. They believed that the complex instruction set actually reduced the real performance of the processor. It was discovered that the complex instructions were seldom executed, and that simple instructions predominated.
Complex instructions require complex decoding circuits, this leads to costly design and increased silicon space, which tends to slow the processor down (limits clock speed and complex instructions take many clock cycles to execute).
Designers set about rethinking what a processor should do. They came up with the following criteria,
It was argued that advances made in design had solved the problem of hardwired control systems (which means higher clock rates); that compilers would generate more effecient code if the instruction set was simple and consistent; and since the instructions executed in a single clock cycle, it should out perform CISC processors.
Programs would be longer because of the simpler instruction set, but the speed of the processor would make up for it, so the overall net result would be a lower execution time.
Processors recently developed which adhere to this criteria are called RISC. The advantages are,
An example of RISC are the MC88100 and IBM RS6000 processors.
Stack Architectures
The processor has a dedicated stack pointer and
all operations are done on the top of the stack.
In a stack machine, there is a sequence of registers that are used in a special way. Imagine that these registers are called A[1], A[2] to A[n].
At the beginning of execution, there are no particular values associated with any of the registers, and a special stack pointer register STP contains the value zero. A load operation bumps STP and copies the contents of a memory location into A[STP].
1: if STP = n, signal a stack overflow
2: else STP = STP + 1
3: A[STP] = data
Conversly, a store operation first checks to see if the registers contain any useful information (if STP > 0). If nor, this indicates a stack underflow, else A[STP] is copied to memory and STP is decreased by 1.
Arithmetic is done by taking the contents of the last two occupied registers (A[STP] and A[STP-1]), combining them as specified by the instruction, and placing the result into A[STP-1]. Since two values have been removed, STP is decreased by 1 to point to the result.
Example operation
Consider the calculation of the formula
E = A * B + C * D
The values A and B are loaded onto the stack and their product is left at the top. Assuming that the values of A, B, C and D are 3, 4, 5 and 2 respectively (and stored in locations $50-$53 respectively), the program looks like,
LOAD 50
LOAD 51
MUL
LOAD 52
LOAD 53
MUL
ADD
The first instruction LOAD 50 leaves the stack like
A[1] = 3
STP = 1
The instruction LOAD 51 leaves the stack like
A[1] = 3
A[2] = 4
STP = 2
The instruction MUL leaves the stack like
A[1] = 12
STP = 1
The instruction LOAD 52 leaves the stack like
A[1] = 12
A[2] = 5
STP = 2
The instruction LOAD 53 leaves the stack like
A[1] = 12
A[2] = 5
A[3] = 2
STP = 3
The instruction MUL leaves the stack like
A[1] = 12
A[2] = 10
STP = 2
The instruction ADD leaves the stack like
A[1] = 22
STP = 1
An example of this architecture is an arithmetic co-processor like the 80387.
Intel Micro-Processors
1971 4004 4 bit DB=4/AB=8 Nibble wide, 256 bytes 1972 8008 8 bit DB=8/AB=8 Byte wide, 256 bytes 8080 8 bit DB=8/AB=16 1978 8086 16 bit DB=16/AB=20 1Mbyte, Segmentation (16x64k), 4.7Mhz 1983 80186 Built in PIC/Bus controller 1983 80286 16 bit DB=16/AB=24 16Mbyte, Introduced Protected Mode, 8-12Mhz 1986 80386 32 bit, DB=32/AB=32 4 Gbyte, Paging, 16-33Mhz 1981 80486 32 bit, DB=32/AB=32 Same as 386 includes 387, 33-50Mhz